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Abstract 

This  paper  concerns  algorithms  that  'learn*  functtons  from  examples.  Functions  on  strings  of  a  finite 
alphabet  are  considered  and  the  notion  of  dimensionality  defined  for  families  of  such  functions.  Using  this 
notion,  a  theorem  is  proved  identifying  the  most  general  conditions  under  which  a  family  of  functtons  can 
be  efficiently  learned  from  examples.  Turning  to  some  familiar  families;  we  present  strong  evidence 
against  the  existence  of  efficient  algorithms  for  learning  the  regular  functions  and  the  polynomial  time 
computable  functions,  even  if  the  size  of  the  encoding  of  the  function  to  be  learned  Is  given.  Our 
arguments  hinge  on  a  new  complexity  measure  -  the  constraint  complexity. 
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1.  Introduction 

This  paper  concerns  algorithms  that  "learn*  functions  from  examples.  In  the  main,  it  is  a  sequel  to 
the  material  in  [Natarajan  1987]  and  contains  the  results  presented  in  [Natarajan  I987b].  The  problem 
has  been  of  interest  over  the  years  to  workers  in  artificial  intelligence,  pattern  recognition  and  numerical 
analysis.  Specifically,  we  are  interested  in  computing  uniformly  good  approximations  to  an  uhknown 
function,  based  on  its  behaviour  on  a  few  sample  points.  This  problem  is  kncwn  as  interpolation  in 
numerical  analysis,  pattern  matching  in  pattern  recognition  and  concept  learning  (arTX)ngst  others)  in 
artificial  intelligence.  As  our  motivation  for  this  study  was  drawn  from  artificial  intelligence,  we  will  use  the 
term  "learning"  instead  of  the  other  two. 

We  begin  with  an  example  to  motivate  our  work.  Consider  the  problem  of  learning  integral  calculus. 
Given  a  table  of  integrals,  one  has  all  the  information  theoretically  required  to  become  an  expert.  Yet, 
worked  examples  and  practice  problems  seem  to  be  necessary  before  one  acquires  any  facility  over  the 
domain.  We  formalize  this  problem  and  show  that  unless  P  =  NP,  examples  play  an  important  role  in  such 
learning.  Our  formalism  covers  many  other  domains  such  as  learning  to  solve  puzzles,  play  games  etc. 
We  then  argue  that  it  is  convenient  to  view  our  formalism  as  an  algorithm  that  learns  functions  from 
examples. 

The  problem  of  inferring  Turing  machines  from  sample  computation  traces  has  been  studied  before 
[Biermann  1974],  but  issues  of  feasibility  or  correctness  have  not  been  addressed.  More  recently,  a 
general  framework  for  uniformly  convergent  learning  of  sinple  concepts  was  proposed  [Valiant  1984]. 
Based  on  this  framework,  some  general  results  on  learning  geometric  concepts  and  boolean  functions 
followed  [Blumer  et  al.  1986,  Natarajan  1987].  Within  the  same  framework,  we  consider  length  preserving 
functions  on  strings  of  a  finite  alphabet.  We  define  the  notion  of  dimensionality  for  families  of  such 
functions  and  give  a  general  theorem  that  states  that  a  family  of  such  functions  can  be  efficiently  learned 
if  and  only  if  it  is  of  polynomial  dimension.  This  is  an  important  contribution  of  the  paper.  Our  approach  is 
similar  to  the  one  in  [Natarajan  1987]  and  aims  at  ease  of  understanding  and  intuitive  appeal.  Turning  to 
functions  on  continuous  spaces,  we  extend  the  results  on  learning  boolean-valued  functions  [Blumer  et  al. 
1986]  to  general  functions. 

We  then  consider  two  familiar  function  families;  the  regular  functions  and  the  polynomial-time 
computable  functions.  Since  these  families  are  not  of  polynomial  dimension,  we  consider  parametrized 
subsets  of  these  families,  the  parameter  being  the  bound  on  the  size  of  the  encodings  of  the  functions. 
We  measure  the  encoding  size  as  the  number  of  states  in  the  deterministic  finite  automaton  computing 
the  function  for  regular  functions,  and  as  the  size  of  the  program  in  some  admissible  programming  system 
for  the  polynomial-time  computable  functions.  We  then  look  for  learning  algorithms  that  run  in  time 
poynomial  in  the  size  bound.  (Summarizing  the  above,  when  atterrpting  to  learn  an  unknown  function,  is 
it  sufficient  to  know  that  the  function  is  regular  (or  polynomial-time  computable)  and  that  it  has  a  short 
encoding,  in  order  to  learn  it  efficiently?) 

For  the  regular  functions,  we  show  that  such  an  algorithm  does  not  exist,  unless  NP  =  RP.  Our 


argument  is  based  on  an  earlier  result  on  the  complexity  of  ordering  the  regular  sets  [Gold  1978,  Angluin 
1978). 

For  the  polynomial-time  computable  functions,  we  argue  that  it  is  unlikely  that  such  an  algorithm 
exists.  Our  argument  is  not  reducible  to  the  condrtion  "unless  NP-RP",  but  is  almost  as  strong,  and 
proceeds  as  follows.  To  start  with,  we  introduce  the  interesting  notion  of  the  Constraint  Complexity  of  a 
set  of  examples  -  a  measure  of  the  information  carried  by  the  set.  This  is  the  second  important 
contribution  of  the  paper.  As  a  backdrop,  we  prove  many  interesting  results  with  this  tool,  including  a 
short  and  intuitive  proof  of  the  dimensionality  theorem  mentioned  earlier.  We  then  argue  that  since  the 
traditional  notion  of  Kolmogorov  complexity  is  a  special  case  of  our  notion  and  there  are  no  known 
algorithms  for  efficiently  computing  the  polynomial-time  bounded  Kolmogorov  complexity,  it  is  unlikely  that 
we  can  construct  one  for  our  measure.  From  this  we  deduce  that  an  efficient  learning  algorithm  for  the 
polynomial-time  functions  is  rather  unlikely. 

2.  Problem  Solving:  An  Example 

Many  problems  such  as  learning  integral  calculus,  learning  to  solve  puzzles,  games  etc  can  be 
expressed  as  follows. 

A  problem  domain  D  is  the  triplet  [LMJ^]  where 

(a)  L,  the  problem  set,  is  any  set  of  strings. 

(b)  Af  is  a  finite  and  fixed  set  of  operators  where  each  m- is  a  function  from  L  to  L. 

(c)  N  is  the  goal  predicate,  a  boolean-valued  lunction  on  L.  A  problem  p  in  L  is  solved  it  N(p)  =  l . 

If  D  were  the  domain  of  integral  calculus,  L  would  be  all  integrals,  M  a  table  of  integrals,  and  N  the 
rule  "problem  is  solved  iff  it  does  not  contain  integral  signs". 

A  solution  of  any  problem  p  is  a(p),  where  a  is  any  sequence  of  operators  from  M  such  that  N(.a(p))  = 
1.  A  problem  solver  for  a  domain  is  an  algorithm  that  takes  as  input  a  problem  and  produces  as  output  a 
solution  of  the  problem. 

Our  interest  is  to  construct  a  meta-algorithm  for  any  given  set  of  domains  H,  that  would  take  as  input 
a  domain  D  from  //  and,  after  some  pre-computation,  behave  like  a  problem-solver  for  D.  We  now  show 
that  VP*  NP,  even  the  simplest  of  domains  will  not  possess  an  efficient  meta-algorithm,  unless  the 
meta-algorithm  is  allowed  to  see  solved  examples  for  its  input  domains. 

Example:  Consider  the  set  of  domains  H  defined  as  follows;  Any  D  =  [LJAJ^]  in  H  is  such  that 

(a)  the  problem  set  L  =  {j:#yl  x,  >  e  (0.1  )*  and  #  is  a  special  symbol) . 

(b)  operator  set  M  =  {m,,  mj,  m3) ,  where 

ferx,)-  e  {0,l)*,a  e  (0,1) 
m^(x#ay)  =  xO#y, 
m^ixMay)  =  xl#>, 
m3(x#>)  =  #xy. 
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(c)  is  a  boolean  function  constructed  as  follows.  Let  N'  be  a  boolean  function  of  n  variables 
and  let  p  €  L. 

Nip)  =  if  Ipl  =  n+1.  Strip  off  the  #  and  evaluate  N'  on  the  resulting  boolean  vector. 

=  0  otherwise. 

In  essence,  each  domain  in  H  is  characterized  by  the  boolean  function  that  is  its  goal  predicate.  Let 
D  =  [LMNi]  in  H  and  let  A'  be  a  function  of  n  variables.  Now,  If  a  is  a  satisfying  assignment  of  N,  then  a  is 
a  solution  to  every  problem  of  length  n+1  in  L.  If  N  is  not  satisfiable,  then  no  problem  in  L  has  a  solution. 
Hence,  every  domain  in  H  trivially  has  a  problem-solver,  but  a  meta-algorithm  on  H  is  going  to  have  to 
decide  on  the  satisfiability  of  boolean  formulae.  Clearly,  an  intractable  problem.  On  the  other  hand,  if  the 
meta-algorithm  is  allowed  to  see  solved  examples  for  the  input  domain,  then  it  can  trivially  decide  whether 
or  not  the  goal  predicate  has  a  satisfying  assignment,  and  then  act  accordingly.  • 

If  the  meta-algorithm  is  allowed  to  see  a  few  examples,  say  pairs  of  the  form  (problem,  solution),  and 
then  be  required  to  compute  the  function  that  maps  each  problem  to  its  solution,  the  entire  process  can 
be  viewed  as  learning  a  good  approximation  to  a  function  from  examples  of  its  behaviour.  This  is  exactly 
the  problem  we  study  below. 

3.  Preliminaries 

Without  loss  of  generality,  let  I  be  the  binary  alphabet  and  r*  the  set  of  all  binary  strings.  We 
consider  functions  from  L*  to  r*.  An  example  of  a  function /  is  a  pair  (x,f(x)).  A  teaming  algorithm  is  an 
algorithm  that  attempts  to  infer  a  function  from  examples  for  it.  The  learning  algorithm  has  at  its  disposal 
a  routine  EXAMPLE,  that  at  each  call  produces  an  example  for  the  function  to  be  learned.  The  probability 
that  a  particular  example  (xj)  will  be  produced  by  a  call  of  EXAMPLE  Is  P(x),  as  given  by  the  probability 
distribution  p.  Also,  the  probability  that  the  learned  function  will  be  queried  on  a  particular  string  x  is  P(x). 
The  distribution  P  can  be  arbitrary  and  unknown. 

We  define  a  family  of  functions  f  to  be  any  set  of  length  preserving  functions  from  r*  to  V.  The 

-subfamily  of  a  family  F,  is  the  family  of  functions  induced  by  F  on  I".  Specifically,  if  F  =/,,/2,...y; . 

then  F„  =  where  g,  is  defined  as  follows. 

g,<-r)  =f,<x)  if  Ixl  =n 

undefined  otherwise 

4.  Uniformly  Convergent  Learning 
4.1  Learnability 

Following  [Valiant  1984],  we  say  that  a  family  of  functions  is  learnable  if  there  exists  a  uniformly 
convergent  learning  algorithm  for  it.  Specifically,  a  family  of  functions  F  is  learnable  if  there  exists  a 
learning  algorithm  that 

(a)  takes  as  input  integers  n  and  h. 


(b)  makes  polynomially  many  calls  of  EXAMPLE,  both  in  the  adjustable  error  parameter  h  and  in  the 
problem  size  n.  EXAMPLE  produces  examples  of  some  function  in  F^. 

(c)  For  aii  functions  /  in  and  all  probability  distributions  P  on  Z",  with  probabiiity  (I-I//1)  the 
algorithm  outputs  a  function  g  in  such  that 

X  e  S 

where  5  =  {x|  W  =  n  and/(x)  *  g(x)] 

Furthermore,  if  the  learning  algorithm  runs  in  time  polynomial  in  n  and  h.  we  say  that  the  family  is 
polynomial-time  leamable. 

We  need  the  following  definitions  as  well. 

A  function/ is  consistent v/'Ah  a  set  of  examples  S  if  (xo’)  e  S  implies /(x)  =  y. 

An  ordering  of  a  sub-family  F^  is  an  inclusive,  onto  mapping  from  sets  of  examples  to  F^. 
Specifically, 

(a)  F,. 

(b)  inclusive:  For  any  S  c  if  there  exists  /  e  F„  consistent  with  5,  then  0„(S)  is  defined 

and  is  consistent  with  S. 

(c)  onto:  For  all /  in  F^,  there  exists  S  c  such  that  OJ,S)  =  /. 

An  ordering  0  of  a  family  F  is  a  sequence  of  sub-orderings  0,,  O2 . such  that  is  an  ordering 

of  F„,  the  n*^sub-family  of  F.  An  ordering  O  is  a  polynomial-time  ordering  if  there  exists  a  polynomial  Tin) 
such  that  each  sub-ordering  0,  of  0  runs  in  time  Tin)  on  inputs  of  length  n. 

The  width  of  an  ordering  0  of  a  sub-family  F^  is  the  least  integer  w  such  that  for  all/ in  F„  there  exists 
a  set  5  of  w  or  fewer  examples  for  which  0(5)  =  /. 

The  dimension  of  a  sub-family  F,  is  the  least  integer  d  for  which  there  exists  an  ordering  of  F„  of 
width  d.  A  family  F  is  of  dimension  Din)  if  there  exists  an  ordering  O  of  F  such  that  for  all  n,  the 
rt’^sub-ordering  0„  of  O  orders  F,  in  width  Din)  or  less.  If  Din)  is  a  polynomial  in  n,  F  is  said  to  be  of 
polynomial  dimension  and  O  of  polynomial  width. 

A  set  5  of  examples  is  shattered  by  a  family  F  if  for  any  5,  c  5  there  exists  /  e  F  such  that  /  is 
consistent  with  5,  but  not  consistent  with  any  non-trivial  subset  of  5-5,. 

Remark  If  IF,I  s  2*  for  some  k,  then  the  dimension  of  F„  a  k/i2n). 

We  are  now  ready  for  our  first  result. 

Lemma  1 :  Let  F„  be  a  subfamily  of  dimension  d.  Then  there  exists  a  set  5  of  J  examples  that  is 
shattered  by  F^. 


Proof:  Let  O  be  an  ordering  for  F^.  We  first  modify  O  to  obtain  O’  as  follows. 

function  o'(G:set  of  examples) 

Let  Cj.  Cj . Cj,...  be  sets  of  examples  in 

increasing  size  and  in  some  canonical  order, 
for  Cj,  C2....do 

if  C»(C,)  is  consistent  with  G 
then  return  0(C,). 
od 
end 

It  is  easy  to  see  0^  is  an  ordering  for  as  well.  Pick  a  function/ in  such  that 

VS:0’(5)=/ implies  ISI  >  d. 

Let  S  be  a  set  such  that/=  0{S).  Now  LSI  i  d.  Suppose  there  exists  a  set  S,  c  S  such  that  any  g  in  F^ 
consistent  with  5,  is  also  consistent  with  some  non-trivial  subset  of  S-S,.  Then,  0'(5,)  =  0^(S-^  for  some 
Sj  c  Sj  c  S.  Modify  G'  to  as  follows. 

0HG)=  0\S)aG  =  S^ 

0*(G)  otherwise. 

Now  is  also  an  ordering  of  F^  except  that  there  is  now  a  set  S2,  LS2I  <  LSI  such  that  0^82)  =  /•  We  can 
repeat  this  process  for  other  functions  in  F^,  eventually  reducing  the  width  of  the  ordering.  Since  the 
width  cannot  be  reduced  below  d,  there  must  be  some  set  of  size  d  or  greater  that  is  shattered  by  F„. 
Which  implies  that  there  is  a  set  of  size  d  shattered  by  F^.  Hence  the  lemma.  • 

Corollary  IF^I  >  2*  for  some  k  implies  that  3  S,  LSI  2r  k/(2n)  that  is  shattered  by  F„.  • 

We  are  now  ready  for  our  main  theorem. 

Theorem  1 :  A  family  of  functions  F  is  leamable  if  and  only  if  it  is  of  polynomial  dimension. 

Proof:  (If)  Let  O  be  an  ordering  for  F  of  width  Din),  where  Din)  is  some  polynomial  in  n.  The 
following  is  a  learning  algorithm  for  F. 

Algorithm  1 

Input:  n,  h. 
begin 

Call  EXAMPLE  IhniDiny+l)  times. 

Let  S  be  the  set  of  examples  obtained. 

Output  OiS). 

end 

Algorithm  1  is  correct  as  reasoned  below.  Let /in  be  the  function  to  be  learned,  i.e,  the  function 
for  which  EXAMPLE  provides  examples  and  let  F  be  the  probability  distribution  on  I".  For  any  g  in  F^, 
define  the  residue  of  g  as  follows. 
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r,  =  I 

Where  S^=[x\g{x)  *Ax))- 
Let  C^be  the  set  given  by 

Cf=  («lg  e  F^andr^  >  !//»} 

i.e.,  Cy  is  the  set  of  functions  in  that  differ  from  the  function  to  be  learned  with  probability  exceeding 
{Mh).  The  probability  that  Algorithm  1  outputs  a  function  from  Cy  should  be  bounded  by  (Mh).  The 
probability  that  m  calls  of  EXAMPLE  will  produce  examples  all  consistent  with  some  particular  function  in 
Cy  is  bounded  by  (1-1//:)'".  Now, 
icy  <  IF„I  < 

Hence  the  probability  that  m  calls  of  EXAMPLE  will  produce  examples  all  consistent  with  any  one  function 
in  Cy  is  bounded  by  Therefore,  if  m  satisfies 

and  Algorithm  1  calls  EXAMPLE  m  times,  Algorithm  1  will  be  within  the  allowable  error  with  high 
probability.  Simplifying,  we  get 
m  >  h{2nD(n)  +  log(n)), 

which  is  satisfied  for  m  =  2hn{D{n)+l)  as  in  Algorithm  1.  Hence,  Algorithm  1  learns  F  and  since  D(n)  is 
polynomial  in  n,  F  is  leamable. 


‘■V- 


(only  if)  Let  F  be  of  super-polynomial  dimension  Din)  and  let  A  claim  to  be  a  learning  algorithm  for  F. 
Let  /I  call  EXAMPLE  (n/i)*  times  on  input  n,  h.  Pick  n.  h  such  that 
d^Din)  >  inh)^li\-2lh). 

By  Lemma  1 .  there  exists  a  set  S  of  Din)  examples  that  is  shattered  by  Place  the  uniform  probability 
distribution 

Fix)  =  Md  if  (xo')  e  S 
=  0  otherwise 

on  5  and  run  A  on  it.  Now,  on  any  m  =  (n/i)*  calls  of  EXAMPLE,  A  will  see  at  most  m  elements  of  S.  Let  5, 
be  the  set  of  examples  seen.  Let  g  be  the  function  output  by  A  and  let  /  be  the  function  to  be  learned. 
Since  5  is  shattered  by  F^,  there  at  least  (2“^^  possibilities  for  /  that  are  consistent  with  the  examples 
seen  by  A.  On  each  element  of  (5-5,),  g  will  differ  with  at  least  half  the  possibilities  for/.  Therefore,  the 
total  number  of  differences  over  all  the  possibilities  for  /  is  at  least  (2‘^(d-m)/2),  and  the  average  is 
id-m)l2.  This  average  must  be  attained  or  exceeded  on  at  least  one  possibility  for /.  Hence,  there  exists 
a  function  /for  which  the  function  g  output  by  A  always  differs  from  /  on  at  least  (d-m)/2  of  the  elements  of 
5.  The  probabilistic  weight  of  this  difference  is 
id-m)l2d  >  l/2-(l-2//i)/2  >  Mh, 
which  is  more  than  the  allowable.  Hence  A  does  not  learn  F. 


•4 


This  completes  our  proof.  • 

Finally,  we  present  a  resource  bounded  version  of  Theorem  1 . 
complexity,  but  other  resource  bounds  may  be  treated  similarly. 


Theorem  2  concerns  time 


Theorem  2:  A  family  of  functions  F  is  polynomial  time  leamable  if  and  only  if  F  has  a  polynomial- 
time  ordering  of  polynomial  width. 

Proof:  Straightfonward  extension  of  Theorem  1 .  • 

Remarks  The  results  presented  in  [Blumer  et  al.  1986,  Natarajan  1987]  concern  learning  sets  from 
samples  of  their  elements.  It  is  easy  to  see  that  sets  are  encodable  as  boolean-valued  functions  and 
hence  can  be  treated  as  a  special  case  of  our  theorem.  Conversely,  a  function  from  {0,1)"  to  (0,1)"  can 
be  viewed  as  a  combination  of  n  boolean-valued  functions  on  (0,1)",  and  hence  learning  functions  can  be 
viewed  as  a  special  case  of  learning  sets. 

In  our  development,  we  used  a  discrete  metric  to  measure  the  distance  between  two  functions  on  an 
input  string  -  two  functions  agreed  on  a  string  or  did  not.  It  is  worth  mention  that  our  arguments  carry 
through  for  any  standard  metric. 

The  following  is  a  resource  bounded,  weak  form  of  Theorem  1 . 

Theorem  2:  A  family  of  functions  F  is  polynomial  time  leamable  (1)  if  f  has  an  ordering  of 
polynomial  width  computable  in  polynomial  time.  (2)  only  if  F  has  an  ordering  of  polynomial  width 
computable  in  random-polynomial  time. 

Proof;  Straightforward  extension  of  Theorem  1 .  • 

4.2  Properties  of  the  Dimension 

For  any  family  of  functions  F,  let  <im(F)  denote  the  dimension  of  F.  Let  A  and  B  be  two  families  of 
functions  such  that  dim(A),  dim{B)  5  1. 

Lemma  2:  If  C  =  n  B,  then  dim{Q  <  min{dm{A),  dim{B)). 

Proof:  Immediate.  • 


Let  A  and  B  be  two  families  from  x,  y,  and  Xj  -»  Yj  respectively.  Then  C  =  A  x  B  is  the  family  of 
functions  from  Xj  x  Xj  ^  T,  x  such  that  each  function  in  C  is  the  product  of  some  two  functions  in  A 
and  B.  i.e 

C={axb\a  e  A,b  e  B) 
where  a  x  is  defined  as  follows: 

For  all  (x,,x2)  e  XjXXj, 

(a  X  b)(Xj ,  Xj)  =  (a(x,),  ^(Xj)) 


Lemma  3;  If  C  =  A  x  B,  then  dim(C)  ^  d:m(A)xiim(B) 

Proof:  Straightfonward.  • 

Lemma  4:  If  C  =  A  u  B  then  dim{C)  <  maxidim{A),dim(B}}+l. 
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Proof:  Without  loss  of  generality,  let  dim(A)  i  dim(B).  Combine  the  minimum  width  orderings  0^,  Og 
for  A  and  B  to  obtain  an  ordering  for  C  as  follows, 
function  0^  iS:  set  of  examples) 

begin 

if  LSI  <  dim(A) 

then  return  0^(5) 
else  return  OgiS) 
end 


Clearly,  is  an  ordering  for  C  of  width  dim(A)  +  i,  • 

Lemma  5:  Let  A=  {a,,  oj,  ...Oj....,)  be  a  family  of  {0,l)-valued  functions  and  let  A  be  the  family 
{a,,a2’  "^i  >1  where  d-  =  1-a-.  Then,  dim(A)  =  dimiA). 

Proof:  Immediate.  • 

5.  Functions  over  Continuous  Spaces 
5.1  Learnability 

As  our  results  are  derived  using  information  theoretic  methods,  it  is  impossible  to  extend  them 
directly  to  continuous  spaces  where  each  example  can  be  of  infinite  length.  On  the  other  hand,  the  results 
in  [Blumer  et  ai  1986]  for  learning  boolean-valued  functions  are  obtained  using  some  classical  results  in 
probability  theory  and  are  valid  over  continuous  spaces.  Hence,  we  will  concentrate  our  efforts  on 
extending  their  results  to  arbitrary  functions. 

As  in  [Blumer  et  al.  1986],  we  define  the  Vapnik-ChervonenkJs  dimenston  d^(,F)  of  a  family  F  as 
follows. 

For  any  set  of  examples  S,  define  the  set  n^S)  as  the  set  of  all  subsets  of  S  obtained  by  intersecting 
S  with  the  functions  in  F.  i.e 

=  {/?l  ^  c  S,  and  3/e  F  such  that / agrees  with  S  on  «  and  disagrees  with  S  on  S-R. 

If  n^5)  =  2^,  we  say  that  F  shatters  S.  d^^(F)  is  the  smallest  integer  d  such  that  no  set  of  cardinality  d+l  is 
shattered  by  F. 

Since  we  no  longer  need  the  notion  of  a  sub-family,  we  modify  our  definition  of  learnability 
accordingly.  In  particular,  a  family  of  functions  F  is  leamable  if  there  exists  an  algorithm  that 

(a)  takes  as  input  an  integer  h, 

(b)  makes  polynomially  many  calls  of  EXAMPLE,  polynomial  in  the  adjustable  error  parameter  h. 

(c)  as  in  the  earlier  definition  of  learnability. 

With  these  definitions  in  hand,  we  can  state  the  following  theorem. 

Theorem  3:  For  any  finite  alphabet  Z,  a  family  of  functions  from  V  to  Z*  is  leamable  if  and  only  if  it  is 


finite  Vapnik-Chervonenkis  dimension. 


Proof:  The  proof  of  this  theorem  is  similar  to  the  proof  of  the  corresponding  theorem  for  boolean 
valued  functions  [Blumer  et  al.  1986].  • 

5.2  Properties  of  the  Dimension 

Lemmas  2,  3,  and  5  stand  in  their  present  form  for  the  Vapnik-Chervonenkis  dimension  as  well. 
Lemma  4  needs  to  be  rewritten  as  follows. 

Lemma  4’:  If  C  =  u8  then 

Proof:  Let  d^j^A)  =  d^  and  d^jiB)  =  dg.  Let  S  be  any  set  of  examples  such  that  LSI  =  j  +dg.  Since 
C  =  A'uB, 

n^(S)  =  n^(.?)  u  0^(5) 

Hence, 

inc(5)i  <  in^(5)i  + 10^(5)! 

By  Lemma  1  of  [Vapnik  and  Chervonenkis  1971], 

l(  0 

and 

|(  0 

Hence 

l(  0*  |(  0 

s(  0*  i,.(  0 

<  2’. 

Hence  C  cannot  shatter  5  if  LSI  >  +dg  implying  that  d^^(,Q  S  d^  +dg  as  claimed.  • 

6.  Two  Famiiiar  Function  Families 

We  now  turn  our  attention  to  two  familiar  function  families  -  regular  sets  and  the  polynomial-time 
computable  functions.  Our  interest  here  is  to  construct  learning  algorithms  for  these  families.  Since 
these  families  are  of  exponential  dimension,  we  modify  our  definition  of  leamability  to  be  meaningful  in 
this  context.  The  motivation  behind  our  definition  is  as  follows.  Suppose  that  we  are  trying  to  learn  an 
unknown  function  from  examples  and  are  told  only  that  the  function  is  regular  (or  computable  in 
polynomial  time)  and  is  accepted  by  an  deterministic  finite  automaton  of  d  states  (has  an  encoding  of 
length  d).  Is  this  information  sufficient  to  enable  us  to  efficiently  learn  the  function? 

Let  7"  be  a  family  of  functions  with  a  measure  on  the  size  of  the  encodings  for  each  function  in  the 
family.  For  any  integer  d,  let  /i*^.  be  the  functions  in  F  of  size  d.  Then,  for  any  n,  the 


nth-subfamily  of  F  with  respect  to  d  is  the  set  of  functions  g,,g2*  •  where 

g,<x)=/i‘'(x)iflxl  =  n 

=  undefined  otherwise 

The  family  F  is  leamable  if  there  exists  an  algorithm  A  that 

(a)  takes  as  input;  problem  size  n,  error  parameter  h  and  output  size  d. 

(b)  runs  in  time  polynomial  in  n,  h,  d.  EXAMPLE  provides  examples  for  some  function  in  F^^. 

(c)  for  all  functions  /  in  f and  all  probability  distributions  P  on  IP,  with  probability  (I-I//1)  the 
algorithm  outputs  a  function  g  in  F^  such  that 
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where  5  =  (x|  bri  =  n  and/^x)  *  g(x)) 

We  say  that  A  learns  F. 

From  Theorem  2  we  know  that  in  order  to  construct  a  learning  algorithm  for  F  in  the  above  sense,  we 
only  need  construct  an  efficient  ordering  for  F,  i.e,  given  a  set  S  of  examples  for  some  function  in  f/,  we 
should  be  able  to  efficiently  compute  a  function  in  consistent  with  S. 

6.1  Regular  Functions 

We  extend  the  notion  of  regular  sets  to  that  of  regular  functions,  by  considering  Mealy  machines 
[Hopcroft  &  Ullman  1979]  instead  of  accept/reject  finite  automata.  Specifically,  we  associate  a  character 
of  the  alphabet  with  transition  of  the  automaton  and  this  character  is  output  each  time  that  transition  is 
completed.  The  function  value  for  a  string  is  the  output  obtained  by  running  the  automaton  on  the  string. 
Our  regular  functions  are  from  I*  to  £*  and  are  length  preserving. 

We  now  consider  the  issue  of  efficiently  ordering  the  regular  sets.  Define  the  encoding  size  of  a 
regular  function  to  be  the  size  of  the  minimal  automaton  that  computes  the  function.  We  need  to  answer 
the  following  question;  given  a  set  of  examples  S  and  an  integer  d,  find  a  deterministic  finite  automaton  of 
size  d,  consistent  with  S.  This  is  equivalent  to  finding  the  minimal  deterministic  finite  automaton  consistent 
with  the  given  set  of  examples.  Unfortunately,  this  problem  is  NP-complete  as  shown  by  [Gold  1978; 
Angluin  1978).  Consequently,  we  conclude  that  it  Is  unlikely  that  the  regular  functions  are  leamable  as 
claimed  below. 

Claim:  If  the  regular  functions  are  leamable  as  defined  above,  then  P-RP. 

Proof:  If  the  regular  functions  were  leamable,  then  we  could  order  them  in  random  polynomial  time. 
But,  as  reported  in  [Gold  1978],  ordering  the  regular  functions  is  an  NP-complete  problem.  Hence  the 
claim.  • 

6.2  The  Polynomial-time  Computable  Functions 

We  consider  the  family  of  all  length  preserving,  polynomial-time  computable  functions.  To  develop 
some  tools  for  our  arguments,  we  first  bok  at  the  family  of  all  computable  functions. 


Consider  the  problem  of  ordering  the  computable  functions.  Let  the  encoding  size  of  a  function  be 
the  size  of  the  shortest  program  computing  the  function  in  some  admissible  programming  system,  say  the 
Turing  machine  system.  We  need  to  be  able  to  compute:  given  a  set  of  examples  S  and  an  integer  d, 
find  a  program  of  size  d  consistent  with  5  -  a  problem  that  is  equivalent  to  computing  the  minimal  program 
consistent  with  5.  This  leads  us  naturally  to  the  notion  of  the  constraint  complexity  C(5)  of  a  set  S  of 
examples. 

G{S)  =  min  ^  3  2,  \z\=d  and  V  (xj)  e  S: 

where  is  the  universal  program.  In  words.  G{S)  is  the  size  of  the  shortest  program  consistent  with  S. 
Contrast  this  with  the  definition  of  the  Kolmogorov  complexity  of  a  string  x,  [Hartmanis  1983). 

K{x)  =  min  \z\=d  and  MJ^z)  =  x. 

If  5  is  a  set  of  examples  for  a  function/,  G(S)  aims  at  measuring  the  amount  of  information  about /carried 
by  S.  This  is  brought  out  in  the  following  propositions. 

Proposition  1 :  For  any  string  x 

C((0“.x))  <  K{x)  <  G(i(P^,x))+logm. 

Proposition  2:  If  5  is  a  set  of  examples  for  a  program  p,  then, 

G(5)  <  Ipl 
G{S)  <  Kip)  +c 
where  c  is  a  small  constant. 

Proposition  2  tells  us  that  the  information  carried  by  a  set  of  examples  is  bounded  by  the  shortest 
description  for  the  program  generating  the  examples.  Extend  the  notation  G(S)  to  Gif)  where  /  is  a 
function,  as  follows;  Gif)  is  the  length  of  the  smallest  program  consistent  with  any  set  of  examples  for/ 
i.e..  Gif)  is  the  length  of  the  shortest  program  computing/. 

Proposition  3:  Let /,  g  be  two  functions  on  £".  Then /  and  g  differ  on  at  feast  IG(/)-G(g)l/2n  strings. 

Proof:  Without  loss  of  generality,  let  Gif)  <  Gig)  and  let  /  and  g  differ  on  fewer  than  iGif)-Gig))lln 
strings.  If  p^  is  the  minimal  program  for  /  construct  a  program  p^  for  g  by  simply  tagging  on  a  table  of 
differences  to  p^  The  length  of  this  tag  is  at  most  2niGif)-Gig))/2n  =  Gif)-Gig)  and  hence  p^  is  a  program 
for  g  that  is  shorter  than  Gig).  A  contradiction  and  hence  the  proposition.  • 

To  illustrate  the  power  of  the  notion  of  constraint  complexity,  we  prove  the  following  version  of  the 
only  //part  of  Theorem  1 . 

Theorem  1’:  Let  F  be  a  family  of  functions  and  let  F^  be  of  dimsension  Din).  Then,  no  algorithm 
that  calls  EXAMPLE  Tin)  times  where  lim  nTin)/Din)  =  0,  can  learn  F. 

Proof:  Let  A  be  a  learning  algorithm  for  F  calling  EXAMPLE  Tin)  times,  where  lim  nTin)/Din)  =  0. 
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Pick  n  such  that  2nT(/i)+tAI<c:D(n).  Since  F„  is  of  dimension  Din),  IF^l  >  and  hence  there  exists  a 


function  /  in  such  that  G{J)  i  0(n);^r(n).  But,  for  any  set  S  of  Tin)  examples  for/  C(5)  s  InTin)  and 
hence  any  function  g  output  by  A  is  such  that  Gig)  s  G(5)  +WI  s  InTin)  +lAl44iD(/i).  By  Proposition  3,  g 
differs  from / on  too  many  strings,  and  hence  A  cannot  learn  F  for  the  uniform  distribution  on  Z".  • 

As  the  reader  might  expect,  the  constraint  complexity  of  sets  of  examples  is  badly  noncomputable, 
displaying  many  of  the  strong  properties  of  Kolmogorov  complexity. 

Proposition  4:  The  set  {51 5  is  a  set  of  examples  and  C(5)  s  151/2)  is  immune,  i.e.,  there  exists  no 
computable  set  that  enjoys  an  infinite  intersection  with  the  above  set. 

Proof:  Similar  to  the  corresponding  result  for  Kolmogorov  complexity.  See  [Natarajan  1985]  for 
example.  • 

Returning  to  the  realm  of  polyrK>mial-time  computable  functions,  we  introduce  the  time-bounded 
constraint  complexity.  For  a  set  of  examples  5  and  time  bound  Tin), 

G^("^(5)  =  min  j  3  z,  \2\=d  and  V  (xj')  €  5:  mJ^'^\zjc)  =  y, 
where  A//W  is  the  Tin)  time  bounded  universal  program.  Hence,  to  order  the  functions  computable  in 
Tin)  time,  we  need  to  able  to  compute  G^("^(5)  for  any  set  5  of  examples.  Unfortunately,  the  best 
algorithm  known  is 

Proposition  5:  G^("^(5)  is  computable  in  non-deterministic  time  I5ir(l5l). 

Proof;  Since  G(5)  5  (ISl  -i-  c)  for  some  constant  c,  simply  guess  a  string  of  that  length  and  verify 
consistency  with  5.  • 

If  Tin)  were  a  polynomial  in  n,  G^^''\S)  is  computable  in  non-deterministic  polynomial  time,  NP.  As 
argued  below,  we  do  not  know  if  we  can  push  it  into  random  polynomial  time  RP,  or  deterministic 
polynomial  time  P. 

Proposition  6:  If  Tin)  is  a  polynomial,  G^^")  is  computable  in  NP,  but  not  known  to  be  in  P  or  RP. 

Proof:  From  Proposition  1  and  the  fact  that  it  is  not  known  whether  polynomial-time  Kolmogorov 
complexity  is  in  RP  or  P.  • 

In  the  light  of  the  above,  we  cannot  give  a  deterministic  polynomial  time  ordering  for  the  polynomial¬ 
time  computable  functions.  In  fact,  we  cannot  even  offer  a  randomized  polynomial-time  algorithm. 
Consequently,  we  cannot  give  a  deterministic  polynomial-time  learning  alogorithm  for  the  poynomial-time 
computable  functions.  Indeed,  it  seems  unlikely  that  such  an  algorithm  exists.  We  can,  however,  give  a 
non-deterministic  polynomial-time  algorithm  as  follows. 


Algorithm  2 


Input:  problem  size  n,  error  parameter  h. 
output  size  d  and  time  bound  n*. 

begin 

Call  EXAMPLE  dfi  times.  Guess  a  string  of  length  d 
and  verify  that  )  is  consistent  with  the 

examples  seen. 

If  so.  output  the  string. 


7.  Conclusion 

This  paper  concerns  algorithms  that  learn  functions  from  examples.  We  considered  length 
preserving  functions  on  strings  of  a  finite  alphabet  and  defined  the  notion  of  dimensionality  for  families  of 
such  functions.  Using  this  notion,  we  proved  a  general  theorem  that  identifies  the  conditbns  under  which 
a  family  of  such  functions  can  be  efficiently  learned.  This  theorem  was  extended  to  functions  on 
continuous  spaces  by  generalizing  the  notion  of  the  Vapnik-Chetvonenkis  dimension  introduced  in 
[Blumer  et  al  1986].  We  then  considered  the  families  of  regular  functions  and  the  polynomial  time 
computable  functions.  We  showed  that  efficient  algorithms  for  learning  the  regular  functions  do  not  exist. 
We  also  argued  that  it  is  unlikely  that  efficient  algorithms  exist  for  the  polynomial-time  computable 
functions.  In  doing  so,  we  introduced  fhe  notion  of  (he  constraint  complexity  of  a  set  of  examples,  a 
notion  that  is  not  only  intuitively  pleasing,  but  a  useful  tool  as  well. 
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